Clusters are inferred from a set of individual multilocus genotypes by estimating genetic ancestry from unobserved source populations, leaving the inferred clusters to be used for assigning group membership to genotypes.
We will review statistical methods that ascertain population structure and estimate genetic ancestry without the use of predefined populations. Here we summarize 10 methods that synthesize concepts from:
The two general approaches:
EDA methods summarize multilocus genotype data sets in terms of a reduced number of components that can be used to interpret the data.
Many EDA methods apply to population and landscape genetics including, for example, ordination and multidimensional scaling techniques (McGarigal et al.2000).The most representative method used for exploratory data analysis is a dimension reduction technique called principal component analysis (PCA).
Principal component analysis (PCA):
Seeks orthogonal projections of multidimensional data, called principal axes or components (PCs), along which the data show the highest variance
The first principal component (PC1) is the projection with the largest variance.
Subsequent PCs represent summaries of the data, which are uncorrelated with the earlier PCs and account for the residual variance in the data
While the first two PCs are often used to explore the structure of variation in the sampled populations, it is generally useful to determine a number of significantly representative components (this can be achieved with Tracy–Widom testing procedures)
While EDA methods avoid formulating hypotheses about the biological processes that underlie the observed data, MBC methods incorporate population genetic models in their analysis.
MBC approaches are:
Computationally more intensive than EDA methods
Infer clusters of individuals based on multilocus genotypes, and assign individuals to the identified clusters.
Many clustering models compute individual ancestry coefficients, considering alleles as proportionally inherited from two or more parental populations.
MBC approaches are useful when:
MBC method in the program STRUCTURE:
Q, in which each element, qik, represents the proportion of individual i’s genome that originates from the parental population k.Bayesian MBC algorithms like STRUCTURE describe the joint probability distribution of the genotypic and ancestry coefficient matrices.
The joint probability distribution decomposes into the product of two terms:
The likelihood distribution
The prior distribution
A classic way of displaying PCA outputs is by using PC plots, which represent individual scores along the first and second principal components.
To illustrate the use of PCA and MBC algorithms, we performed some simulations of genotypic data under two classic models of discrete and continuous populations, and examined the results produced by these methods. Reference the above figure.
Geography is often reported to be a major determinant of population genetic variation in landscape genetic studies. Recent developments of EDA and MBC algorithms include geographic information in the inference of genetic clusters.
The goal of this category of methods is to:
Improve the assignment of individuals to genetic clusters
Account for isolation-by-distance (IBD) effects in statistical estimates of ancestry coefficients
EDA and MBC methods that consider spatial information reveal spatial genetic patterns more efficiently than methods that do not include this information. In this section, we describe landscape genetic EDA and MBC methods that make explicit use of geographic information.
Let us first review EDA methods that consider both geographic and genetic data.
For example, Jombart et al. (2008) used sPCA to discuss the identification of management units for conservation purposes in the endangered brown bear (Ursus arctos) (Waits et al. 2000; Manel et al. 2004). Their study detected differentiation among northern individuals along a gradient that correlates with the IBD gradients and found that the conservation units defined in previous studies corresponded to non-negative eigenvalues.
In Fig. 7.2, we applied PCA and spFA to a worldwide sample of genomic DNA from 388 individuals in 27 Asian populations. The dataset used a panel of 10,664 SNPs from the Harvard Human Genome Diversity Project (Frichot et al.2012).
Samples from Central Asia were represented with light colors, where as populations from East Asia were represented with dark colors.
A. The PC plot exhibited a horseshoe effect, revealing the existence of IBD patterns in the data. PCA described a continuum of populations without observable genetic discontinuities.
B. The spFA was applied to the data using a scale parameter that corrected for the effects of IBD in the first factors
The results provided evidence of a major discontinuity separating two human genetic clusters, one in Central Asia and one in East Asia. Uyghur and Hazara populations were projected in an intermediate position, suggesting shared ancestry from ancestral gene pools located in Central and East Asia.
The spatial MBC models rely on geographically explicit prior distributions and adopt the same likelihood framework as non-spatial MBC methods. The spatial MBC models are implemented in three main software packages:
GENELAND (Guillot et al. 2005)
TESS (Chen et al. 2007; Durand et al. 2009)
BAPS (Corander et al. 2008)
In this section, we survey EDA and MBC methods that include models of habitat and habitat suitability and that evaluate correlations between genetic variation and environmental gradients.
An important class of EDA methods, called constrained ordination, can be employed to incorporate the effects of environmental gradients into landscape genetics studies. The methods include:
For instance, Parisod and Christin (2008) used multivariate analyses to evaluate fine-scale ecological heterogeneity within continuous populations of Biscutella laevigata, a high mountain flowering plant. CCA identified the factors responsible for the partitioning of the genetic variance and revealed a composite effect of IBD, phenological divergence, and local adaptation to habitats characterized by different solar radiation regimes.
Bayesian methods that use predefined populations can identify environmental factors that correlate with population subdivision (Foll & Gaggiotti 2006).
Similarly, MBC algorithms were recently extended to include environmental variables. + MBC algorithms can evaluate population structure and can simultaneously infer ancestry coefficients based on associations with environmental variables. + The POPS model measures the effects of environmental variables on individual ancestry while correcting for geographic trends and spatial auto-correlation.
Based on geographic and environmental data, EDA and MBC algorithms can provide routines to project ancestry coefficients on geographic maps under scenarios of environmental change, leading to ancestry distribution models (ADM).
The ADM approach shares an analogy to species distribution models (SDM), which are statistical models correlating species abundance data to environmental predictors. + Analyzing data at a finer scale than SDMs, ADMs build projections of intraspecific genetic variation based on correlations between environmental variables and individual ancestry. + ADMs can be used to estimate changes in population genetic structure and the magnitudes of pole-ward displacement for areas of mixed ancestry separating pairs of genetic clusters + For example, forecasts from regression models can be done by varying climatic variables under various scenarios, considering temperatures globally increasing from 2 °Cto4°C in the next century
Likelihood: A quantity that describes the probability of the data in a particular statistical model conditional on the model parameters.
Exploratory Data Analysis (EDA): A set of descriptive methods for summarizing data using visual representations from computer packages without statistical models.
Model-based clustering (MBC): A set of approaches where the data are clustered using statistical mixture models.
Principal component analysis (PCA): Amethod of data reduction that replaces a set of observed variables, such as allele frequencies, that may be correlated among themselves by a set of synthetic variables, the PCA axes, that are orthogonal and perfectly uncorrelated with each other. The first PCAaxisisdefined to capture a maximum of the variation in the original data set, the second PCA axes a maximum of the remaining variance, and so forth.
Spatial autocorrelation: Measure of the degree to which individuals that are closer together in space are more genetically similar than those further apart.
Inherent spatial autocorrelation: Auto-correlation created by a biological process affecting the response (e.g., allele frequencies) directly, such as spatially restricted mating and dispersal
Induced spatial dependence: Spatial auto-correlation created by response to spatially structured landscape factors. If these factors are correctly included in the model, the autocorrelation will be removed from the residuals, but if some factors are missing or incorrectly specified, the residuals may show spatial autocorrelation.
Non-Stationary Spatial Assumption: Refers to violations of the assumption of stationarity (i.e., that the process that generated the pattern has the same intensity and variability over the entire study area) and can be detected when either the correlation between genetic data Y and landscape predictors X, or the mean, variance, and spatial autocorrelation in the residuals U are not constant across the study area.
Joint probability distribution: The corresponding probability distribution on all possible pairs of outputs
Constrained ordination: Uses a prior hypothesis to produce the ordination plot (i.e. they relate a matrix of response variables to explanatory variables). They only display the variation in the data of the explanatory variables (versus unconstrained which display all the variation in the data).
Base code provided by Sean Schoville
Picture of the Oregon threespine stickleback
The goals of this lab are to:
Assess how patterns of genetic variation can be used to delimit natural populations.
Compare methods that assess population structure.
Understand how population structure can be used to interpret biogeographic history.
All files are distributed as system files with the ‘LandGenCourse’ package (folder ‘extdata’).
Simulated data using the two-island model and admixture model. SNP data from Catchen et al. 2013 Catchen et al. 2013. The population structure and recent colonization history of Oregon threespine stickleback determined using restriction-site associated DNA-sequencing. Molecular Ecology 22:1365-294X. http://dx.doi.org/10.1111/mec.12330
Note: the function ‘library’ will always load the package, even if it is already loaded, whereas ‘require’ will only load it if it is not yet loaded. Either will work.
require(LandGenCourse)
## Loading required package: LandGenCourse
#require(LandGenCourseData)
#require(fields)
#require(RColorBrewer)
#require(maps)
#require(mapplots)
#require(here)
The package ‘LEA’ is available from the ‘Bioconductor’ repository. Also, package ‘fields’ was not installed with ‘LandGenCourse’ automatically due to compatibility issues.
if(!requireNamespace("fields", quietly = TRUE)) install.packages("fields", repos='http://cran.us.r-project.org')
if(!requireNamespace("LEA", quietly = TRUE)) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("LEA")
}
The data are in a data package:
if(!requireNamespace("LandGenCourseData", quietly = TRUE))
devtools::install_github("hhwagner1/LandGenCourseData")
We simulated data under a classic two-island model, using the coalescent simulation program ‘ms,’ developed by Richard Hudson. The program simulates a coalescent tree under various demographic models, and uses those trees to create allelic data. For those interested in using ‘ms,’ the source code is available here: http://home.uchicago.edu/rhudson1/source/mksamples.html
We simulated 200 haploid individuals genotyped at 100 loci. The effective mutation rate was ‘μ = 0.5.’ We sampled 2 islands with 100 individuals in each. The effective migration rate was ‘Nm 2’ (‘N’ is the effective size in each of the two island, ‘m’ is the bidirectional rate of gene flow).
Our ms command was as follows:ms 200 100 -t .5 -I 2 100 100 -ma x 2 2 x > dataNm2.txt
These raw data need to be converted in a format amenable to statistical analyses in R.
file <- scan(file = system.file("extdata", "dataNm2.txt", package = "LandGenCourse"),
what ="character", sep="\n", skip = 2)
genotype <- NULL
for(locus in 1:100){
res.locus <- file[4:203]
file <- file[-(1:203)]
genotype <- cbind(genotype, as.numeric(as.factor(res.locus)))}
dim(genotype)
## [1] 200 100
Now we have a new data file, genotype, loaded in R. This file contains 200 rows and 100 columns. Each row corresponds to a simulated individual. The columns code for their multi-locus genotypes.
Our first objective is to use ordination (Principal components analysis, or PCA) to examine population structure for the Nm = 2 data set.
The R command for PCA is fairly simple and fast:
pc = prcomp(genotype, scale =T)
In order to visualize how the first two eigenvectors capture genotype variation, we will color each population.
par(mar=c(4,4,0.5,0.5))
plot(pc$x, pch = 19, cex = 2, col = rep(c("blue2", "green"), each = 100))
Question 1:
Is population structure (genetic differentiation) evident?
Yes, this is made evident by the clustering of the two different “populations”
Let’s look at them in chart form:
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.77699 1.65283 1.5587 1.53452 1.48944 1.48496 1.46718
## Proportion of Variance 0.07712 0.02732 0.0243 0.02355 0.02218 0.02205 0.02153
## Cumulative Proportion 0.07712 0.10443 0.1287 0.15228 0.17446 0.19651 0.21804
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.43774 1.41957 1.40980 1.40041 1.38541 1.37335 1.35879
## Proportion of Variance 0.02067 0.02015 0.01988 0.01961 0.01919 0.01886 0.01846
## Cumulative Proportion 0.23871 0.25886 0.27874 0.29835 0.31754 0.33640 0.35487
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.35395 1.31441 1.3078 1.30115 1.29493 1.26846 1.26525
## Proportion of Variance 0.01833 0.01728 0.0171 0.01693 0.01677 0.01609 0.01601
## Cumulative Proportion 0.37320 0.39048 0.4076 0.42451 0.44128 0.45737 0.47338
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.23786 1.23212 1.22814 1.20358 1.18924 1.17923 1.17139
## Proportion of Variance 0.01532 0.01518 0.01508 0.01449 0.01414 0.01391 0.01372
## Cumulative Proportion 0.48870 0.50388 0.51896 0.53345 0.54759 0.56150 0.57522
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.16216 1.12304 1.10610 1.10157 1.09177 1.07816 1.05735
## Proportion of Variance 0.01351 0.01261 0.01223 0.01213 0.01192 0.01162 0.01118
## Cumulative Proportion 0.58873 0.60134 0.61357 0.62571 0.63763 0.64925 0.66043
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.04793 1.02672 1.01936 1.01364 1.00225 0.99349 0.98594
## Proportion of Variance 0.01098 0.01054 0.01039 0.01027 0.01005 0.00987 0.00972
## Cumulative Proportion 0.67141 0.68195 0.69235 0.70262 0.71267 0.72254 0.73226
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.96673 0.96357 0.9485 0.94380 0.92308 0.91570 0.90220
## Proportion of Variance 0.00935 0.00928 0.0090 0.00891 0.00852 0.00839 0.00814
## Cumulative Proportion 0.74160 0.75089 0.7599 0.76879 0.77731 0.78570 0.79384
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.89163 0.88277 0.86488 0.84941 0.83872 0.83328 0.82304
## Proportion of Variance 0.00795 0.00779 0.00748 0.00721 0.00703 0.00694 0.00677
## Cumulative Proportion 0.80179 0.80958 0.81706 0.82427 0.83131 0.83825 0.84503
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.81748 0.80746 0.79085 0.77042 0.76650 0.76104 0.7553
## Proportion of Variance 0.00668 0.00652 0.00625 0.00594 0.00588 0.00579 0.0057
## Cumulative Proportion 0.85171 0.85823 0.86448 0.87042 0.87629 0.88209 0.8878
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.72978 0.7279 0.72353 0.70575 0.69818 0.68517 0.68335
## Proportion of Variance 0.00533 0.0053 0.00523 0.00498 0.00487 0.00469 0.00467
## Cumulative Proportion 0.89312 0.8984 0.90365 0.90863 0.91351 0.91820 0.92287
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.67501 0.65756 0.63896 0.62114 0.61388 0.60925 0.60217
## Proportion of Variance 0.00456 0.00432 0.00408 0.00386 0.00377 0.00371 0.00363
## Cumulative Proportion 0.92743 0.93175 0.93583 0.93969 0.94346 0.94717 0.95080
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.58408 0.56742 0.55817 0.54962 0.53610 0.52625 0.52442
## Proportion of Variance 0.00341 0.00322 0.00312 0.00302 0.00287 0.00277 0.00275
## Cumulative Proportion 0.95421 0.95743 0.96054 0.96356 0.96644 0.96921 0.97196
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.50528 0.49755 0.4798 0.46734 0.45256 0.44882 0.44121
## Proportion of Variance 0.00255 0.00248 0.0023 0.00218 0.00205 0.00201 0.00195
## Cumulative Proportion 0.97451 0.97699 0.9793 0.98147 0.98352 0.98554 0.98748
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 0.41786 0.41163 0.39781 0.37758 0.36923 0.36602 0.34913
## Proportion of Variance 0.00175 0.00169 0.00158 0.00143 0.00136 0.00134 0.00122
## Cumulative Proportion 0.98923 0.99092 0.99251 0.99393 0.99529 0.99663 0.99785
## PC99 PC100
## Standard deviation 0.34128 0.31336
## Proportion of Variance 0.00116 0.00098
## Cumulative Proportion 0.99902 1.00000
Question: How much of the genetic variance can be explained by our first two components? 11%
Next we would like to see how population genetic structure relates to geographic space. To this aim, we could display PC maps. A PC map is a spatial interpolation of a particular component. Let’s map PC 1
par(mar=c(4,4,0.5,0.5))
coord <- cbind(sort(c(rnorm(100, -2, 1), rnorm(100, 2, 1))), runif(200))
fit <- fields::Krig(coord, pc$x[,1], m=1)
fields::surface(fit)
points(coord, pch=19)
This map predicts the value of the first principal component at each location in our study area. We observe that the study area is partitioned into two zones that correspond to the 2 clusters visible from PC 1. We have superimposed individual sample sites to see our species distribution.
To check that the PC map is consistent with having 2 islands, we can examine the PC assignment vs the sampling location. Remember that, in the data sets, the 100 first individuals were sampled from island 1 and the last 100 were sampled from island
table((pc$x[,1] > 0) == (1:200 <= 100))
##
## FALSE TRUE
## 1 199
In this example, we found that only one individual was not assigned to its island of origin. Well, this individual might be a migrant from the last generation. These results indicated that a very simple method based on principal component analysis can correctly describe spatial population genetic structure.
Re-run the first analytical steps up to PCA for data simulated with Nm = 10 (a higher value of gene flow), which is stored in datafile “dataNm10.txt.” This was generated with the following ms command:
ms 200 100 -t .5 -I 2 100 100 -ma x 10 10 x > dataNm10.txt
file <- scan(file = system.file("extdata", "dataNm10.txt", package = "LandGenCourse"),
what ="character", sep="\n", skip = 2)
genotype <- NULL
for(locus in 1:100){
res.locus <- file[4:203]
file <- file[-(1:203)]
genotype <- cbind(genotype, as.numeric(as.factor(res.locus)))}
dim(genotype)
## [1] 200 100
pc = prcomp(genotype, scale =T)
par(mar=c(4,4,0.5,0.5))
plot(pc$x, pch = 19, cex = 2, col = rep(c("blue2", "orange"), each = 100))
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.13035 1.65962 1.60797 1.57144 1.56120 1.54312 1.47837
## Proportion of Variance 0.04538 0.02754 0.02586 0.02469 0.02437 0.02381 0.02186
## Cumulative Proportion 0.04538 0.07293 0.09878 0.12348 0.14785 0.17166 0.19352
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.47244 1.46508 1.45386 1.43714 1.40928 1.39436 1.37577
## Proportion of Variance 0.02168 0.02146 0.02114 0.02065 0.01986 0.01944 0.01893
## Cumulative Proportion 0.21520 0.23666 0.25780 0.27845 0.29832 0.31776 0.33669
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.35787 1.32736 1.31453 1.29884 1.28979 1.27223 1.26561
## Proportion of Variance 0.01844 0.01762 0.01728 0.01687 0.01664 0.01619 0.01602
## Cumulative Proportion 0.35512 0.37274 0.39002 0.40689 0.42353 0.43971 0.45573
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.25975 1.25123 1.24060 1.2286 1.19367 1.18913 1.17116
## Proportion of Variance 0.01587 0.01566 0.01539 0.0151 0.01425 0.01414 0.01372
## Cumulative Proportion 0.47160 0.48726 0.50265 0.5177 0.53199 0.54613 0.55985
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.15355 1.14520 1.14136 1.12293 1.11589 1.10717 1.09334
## Proportion of Variance 0.01331 0.01311 0.01303 0.01261 0.01245 0.01226 0.01195
## Cumulative Proportion 0.57315 0.58627 0.59930 0.61191 0.62436 0.63662 0.64857
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.07522 1.05704 1.04080 1.02927 1.02555 1.01323 1.01149
## Proportion of Variance 0.01156 0.01117 0.01083 0.01059 0.01052 0.01027 0.01023
## Cumulative Proportion 0.66013 0.67130 0.68214 0.69273 0.70325 0.71352 0.72375
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.99667 0.98043 0.97414 0.95534 0.93164 0.9276 0.91014
## Proportion of Variance 0.00993 0.00961 0.00949 0.00913 0.00868 0.0086 0.00828
## Cumulative Proportion 0.73368 0.74329 0.75278 0.76191 0.77059 0.7792 0.78748
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.90306 0.89196 0.88206 0.87677 0.86650 0.84759 0.84340
## Proportion of Variance 0.00816 0.00796 0.00778 0.00769 0.00751 0.00718 0.00711
## Cumulative Proportion 0.79563 0.80359 0.81137 0.81905 0.82656 0.83375 0.84086
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.83424 0.82413 0.81034 0.79482 0.7809 0.76755 0.76408
## Proportion of Variance 0.00696 0.00679 0.00657 0.00632 0.0061 0.00589 0.00584
## Cumulative Proportion 0.84782 0.85461 0.86118 0.86750 0.8736 0.87949 0.88532
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.75591 0.73200 0.7279 0.71543 0.69951 0.68691 0.68006
## Proportion of Variance 0.00571 0.00536 0.0053 0.00512 0.00489 0.00472 0.00462
## Cumulative Proportion 0.89104 0.89640 0.9017 0.90681 0.91171 0.91642 0.92105
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.66785 0.65373 0.63836 0.6328 0.61995 0.61529 0.59667
## Proportion of Variance 0.00446 0.00427 0.00408 0.0040 0.00384 0.00379 0.00356
## Cumulative Proportion 0.92551 0.92978 0.93386 0.9379 0.94171 0.94549 0.94905
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.59500 0.58394 0.57004 0.56620 0.55879 0.54500 0.53603
## Proportion of Variance 0.00354 0.00341 0.00325 0.00321 0.00312 0.00297 0.00287
## Cumulative Proportion 0.95259 0.95600 0.95925 0.96246 0.96558 0.96855 0.97142
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.52661 0.5005 0.49452 0.48761 0.47614 0.45771 0.43933
## Proportion of Variance 0.00277 0.0025 0.00245 0.00238 0.00227 0.00209 0.00193
## Cumulative Proportion 0.97420 0.9767 0.97915 0.98152 0.98379 0.98589 0.98782
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 0.43180 0.41775 0.39442 0.38798 0.36503 0.35056 0.32099
## Proportion of Variance 0.00186 0.00175 0.00156 0.00151 0.00133 0.00123 0.00103
## Cumulative Proportion 0.98968 0.99143 0.99298 0.99449 0.99582 0.99705 0.99808
## PC99 PC100
## Standard deviation 0.31378 0.30605
## Proportion of Variance 0.00098 0.00094
## Cumulative Proportion 0.99906 1.00000
par(mar=c(4,4,0.5,0.5))
coord <- cbind(sort(c(rnorm(100, -2, 1), rnorm(100, 2, 1))), runif(200))
fit <- fields::Krig(coord, pc$x[,1], m=1)
fields::surface(fit)
points(coord, pch=19)
table((pc$x[,1] > 0) == (1:200 <= 100))
##
## FALSE TRUE
## 186 14
Question 2: Does PCA provide an accurate description of population genetic structure when the genetic differentiation between the 2 islands is less pronounced? NO
A 2-island model is a relatively simple scenario and unlikely to capture the variation we will see in empirical studies. Will our very basic assignment method based on PCA hold up to more complex scenarios?
Let’s consider a scenario where the 2 populations had been evolving for a long time under an equilibrium island model, and then their environment suddenly changed. Our 2 populations had to move to track their shifting habitat, and after these movements they come into contact in an intermediate region. This contact event resulted in an admixed population with the density of mixed individuals greater in the center of the contact zone than at the ancestral origins at the edges of the landscape.
Using R and our previous simulation, a multi-locus cline that resumes this scenario can be simulated has follows. The source population data are in the file “dataNm1.str.”
First we define a function for the shape of a cline:
# A function for the shape of a cline
sigmoid <- function(x){ 1/(1 + exp(-x))}
p1 <- sigmoid( 0.5 * coord[,1])
Our admixed genotypes are built from a 2 island model with low gene flow (Nm=1)
genotype = read.table(file = system.file("extdata", "dataNm1.str", package = "LandGenCourse"))[,-(1:2)]
admixed.genotype <- matrix(NA, ncol = 100, nrow = 200)
for (i in 1:100){ for (j in 1:100)
admixed.genotype[i,j] = sample( c(genotype[i, j],genotype[i+100, j]), 1, prob = c(p1[i], 1 - p1[i]) )}
for (i in 101:200){ for (j in 1:100)
admixed.genotype[i,j] = sample( c(genotype[i - 100, j],genotype[i, j]), 1, prob = c(p1[i], 1 - p1[i]) )}
res <- data.frame(coord, admixed.genotype)
Now our data set is the R object ‘res.’ The next exercise is to apply PCA to these data to evaluate how geographical genetic variation can be captured by this approach. In comparison with the previous example where we had two geographically discrete populations, we now have a “continuous” population in a landscape. Geographical genetic variation is thus expected to be more gradual than in the previous example. Generate and examine the PC 1 map.
par(mar=c(4,4,0.5,0.5))
pcA = prcomp(admixed.genotype, scale =T)
plot(pcA$x, pch = 19, cex = 2, col = rep(c("blue2","green"), each = 100))
Look at the PC Map:
par(mar=c(4,4,0.5,0.5))
fit <- fields::Krig(res, pcA$x[,1], m=1)
## Warning:
## Grid searches over lambda (nugget and sill variances) with minima at the endpoints:
## (REML) Restricted maximum likelihood
## minimum at right endpoint lambda = 0.05291003 (eff. df= 190 )
fields::surface(fit)
points(res)
Question 3: How does admixture change our prediction of population structure (PCA plot)? Is genomic ancestry correlated with geographical location?
To answer this latter part, check the R2 and significance of statistical association between PC1 component scores and geographical position (p1):
summary(lm(pcA$x[,1]~ p1))
##
## Call:
## lm(formula = pcA$x[, 1] ~ p1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0747 -0.6805 0.0225 0.7729 2.6013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.7191 0.1853 -20.07 <2e-16 ***
## p1 7.5072 0.3398 22.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.096 on 198 degrees of freedom
## Multiple R-squared: 0.7114, Adjusted R-squared: 0.7099
## F-statistic: 488 on 1 and 198 DF, p-value: < 2.2e-16
The Threespine stickleback (Gasterosteus aculeatus) is a fish that has emerged as a model of rapid and parallel adaptation. Catchen et al. (2013) were interested in how populations colonize freshwater habitats in the Pacific Northwest, USA. These sticklebacks have diversified into three life history forms, one exclusively dwelling in the ocean, another being adapted to freshwater habitats, and one unusual population that can utilize both habitats. It was unclear if this one particular population (Riverbend), from a stream in central Oregon, was introduced due to unintentional human transport, and if this could be an example of rapid adaptation to freshwater from oceanic populations. Single nucleotide polymorphism data were generated using genotyping-by-sequencing, for 9 populations occupying coastal areas and inland streams.
In this tutorial, we will analyze the genetic data generated by Catchen et al. (2013) using a few exploratory methods to quantify and visualize genetic differentiation among the stickleback populations sampled. By the end of this tutorial, hopefully you will be able to make a convincing argument for the regional origin of the recently-introduced inland stickleback population.
data <- read.table(system.file("extdata", "stickleback_data.txt", package = "LandGenCourseData"),
sep="\t", as.is=T, check.names=F)
Create a list of population IDs:
pops <- unique(unlist(lapply(rownames(data),
function(x){y<-c();y<-c(y,unlist(strsplit(x,"_")[[1]][1]))})))
To understand the experimental design a bit better, let’s look at the sample size at each site.
sample_sites <- rep(NA,nrow(data))
for (i in 1:nrow(data)){
sample_sites[i] <- strsplit(rownames(data),"_")[[i]][1]}
N <- unlist(lapply(pops,function(x){length(which(sample_sites==x))}))
names(N) <- pops
N
## cr cs pcr pl sj stl wc rb ms
## 23 97 67 20 86 50 22 138 68
Let’s start examining population structure, first using PCA. We’ll look at the amount of variation explained in the first few components, and then we’ll plot individuals for four components, coloring them by population.
par(mar=c(4,4,2,0.5))
pcaS <- prcomp(data,center=T)
plot(pcaS$sdev^2 / sum(pcaS$sdev^2), xlab="PC",
ylab="Fraction Variation Explained", main="Scree plot")
Get % variance explained for first few PCs:
perc <- round(100*(pcaS$sdev^2 / sum(pcaS$sdev^2))[1:10],2)
names(perc) <- apply(array(seq(1,10,1)), 1, function(x){paste0("PC", x)})
perc
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 37.47 3.78 2.55 0.88 0.76 0.44 0.44 0.39 0.37 0.32
Use the RColorBrewer package to select a color palette:
colors <- RColorBrewer::brewer.pal(9, "Paired")
Plot first three PCs with colored symbols:
par(mfrow=c(2,2), mar=c(4,4,0.5,0.5))
plot(pcaS$x[,1:2], col=colors[factor(sample_sites)], pch=16,cex=1.2)
legend("bottomleft", legend=levels(factor(sample_sites)),
col=colors, pch=16, ncol=3, cex=0.8)
plot(pcaS$x[,2:3], col=colors[factor(sample_sites)], pch=16, cex=1.2)
plot(pcaS$x[,3:4], col=colors[factor(sample_sites)], pch=16, cex=1.2)
Question 4:
Do you see evidence of population structure? Yes, the position of the red and orange populations farther away from the green populations suggests strucute is present.
‘STRUCTURE’)Now we are going to use a clustering method to examine population structure. There are many approaches, with various assumptions, and it is important to consider the underlying biology of your research organism (and your dataset size) before choosing an appropriate method.
Here, we will use sparse negative matrix factorization (SNMF) because it is fast to compute for large datasets and it approximates the results of the well-known STRUCTURE algorithm. Notably, it relaxes population genetic assumptions such as Hardy-Weinberg proportions, so it may not converge on the same results as other programs.
We can use SNMF to estimate the number of genetic clusters (K) among our sampled populations. However, this may take a long time. If you want to run the analysis, un-comment the lines by removing the ‘#’ symbol at the beginning of each line.
We use SNMF’s cross-entropy criterion to infer the best estimate of K. The lower the cross-entropy, the better our model accounts for population structure. Sometimes cross-entropy continues to decline, so we might choose K where cross entropy first decreases the most.
#snmf2 <- LEA::snmf(paste0(here::here(), "/data/stickleback.geno"),
# K=1:8, ploidy=2, entropy=T, alpha=100, project="new")
#snmf2 <- LEA::snmf("stickleback.geno", K=1:8, ploidy=2, entropy=T,
# alpha=100, project="new")
#par(mfrow=c(1,1))
#plot(snmf2, col="blue4", cex=1.4, pch=19)
The number of clusters is hard to determine, but four seems to be important and is similar to the results revealed by PCA. I will proceed assuming K=4. We will rerun SNMF using this setting.
K=4
snmf = LEA::snmf(system.file("extdata", "stickleback.geno", package = "LandGenCourseData"),
K = K, alpha = 100, project = "new")
## The project is saved into :
## stickleback.snmfProject
##
## To load the project, use:
## project = load.snmfProject("stickleback.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("stickleback.snmfProject")
##
## [1] "*************************************"
## [1] "* sNMF K = 4 repetition 1 *"
## [1] "*************************************"
## summary of the options:
##
## -n (number of individuals) 571
## -L (number of loci) 10000
## -K (number of ancestral pops) 4
## -x (input file) /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.geno
## -q (individual admixture file) /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.snmf/K4/run1/stickleback_r1.4.Q
## -g (ancestral frequencies file) /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.snmf/K4/run1/stickleback_r1.4.G
## -i (number max of iterations) 200
## -a (regularization parameter) 100
## -s (seed random init) 4607182419176566763
## -e (tolerance error) 1E-05
## -p (number of processes) 1
## - diploid
##
## Read genotype file /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.geno: OK.
##
##
## Main algorithm:
## [ ]
## [=============]
## Number of iterations: 36
##
## Least-square error: 722698.111960
## Write individual ancestry coefficient file /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.snmf/K4/run1/stickleback_r1.4.Q: OK.
## Write ancestral allele frequency coefficient file /Library/Frameworks/R.framework/Versions/4.1/Resources/library/LandGenCourseData/extdata/stickleback.snmf/K4/run1/stickleback_r1.4.G: OK.
##
## The project is saved into :
## stickleback.snmfProject
##
## To load the project, use:
## project = load.snmfProject("stickleback.snmfProject")
##
## To remove the project, use:
## remove.snmfProject("stickleback.snmfProject")
Create matrix of ancestry proportions:
qmatrix = LEA::Q(snmf, K = K)
Plot results with a barplot similar to that used to represent STRUCTURE results
par(mar=c(4,4,0.5,0.5))
barplot(t(qmatrix), col=RColorBrewer::brewer.pal(9,"Paired"),
border=NA, space=0, xlab="Individuals",
ylab="Admixture coefficients")
#Add population labels to the axis:
for (i in 1:length(pops)){
axis(1, at=median(which(sample_sites==pops[i])), labels=pops[i])}
Import geographical coordinates for the populations:
sites <- read.csv(system.file("extdata", "stickleback_coordinates.csv",
package = "LandGenCourseData"), as.is=T, check.names=F, h=T)
Calculate population average ancestry proportions and create an array with population coordinates:
#initialize array for ancestry proportions:
qpop <- matrix(NA,nrow=length(pops),ncol=K)
#intialize array for coordinates:
coord.pop <- matrix(NA,nrow=length(pops),ncol=2)
index=0
for (i in 1:length(pops)){
if (i==1){
## i.put pop ancestry proportions for each K cluster:
qpop[i,] <- apply(qmatrix[1:N[i],], 2, mean)
#input pop coordinates:
coord.pop[i,1] <- sites[which(sites[,1]==names(N[i])),6]
#input pop coordinates:
coord.pop[i,2] <- sites[which(sites[,1]==names(N[i])),5]
index = index + N[i]
} else {
qpop[i,] <- apply(qmatrix[(index+1):(index+N[i]),], 2, mean)
coord.pop[i,1] <- sites[which(sites[,1]==names(N[i])),6]
coord.pop[i,2] <- sites[which(sites[,1]==names(N[i])),5]
index = index + N[i]
}
}
Create map with pie charts depicting ancestry proportions:
par(mar=c(4,4,0.5,0.5))
plot(coord.pop, xlab = "Longitude", ylab = "Latitude", type = "n")
maps::map(database='state',add = T, col = "grey90", fill = TRUE)
for (i in 1:length(pops)){
mapplots::add.pie(z=qpop[i,], x=coord.pop[i,1],
y=coord.pop[i,2], labels="",
col=RColorBrewer::brewer.pal(K,"Paired"),radius=0.1)
}
Notes are from the Landscape Genetics Concepts Textbook (Balkenhol et al) and the Landscape Genetics online course (DGS 2022).